Anchor Points Algorithms for Hamming and Edit Distance

نویسندگان

  • Foto Afrati
  • Anish Das Sarma
  • Anand Rajaraman
  • Semih Salihoglu
  • Jeffrey Ullman
چکیده

Algorithms for computing similarity joins in MapReduce were offered in [2]. Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the “anchor-points algorithm” of [2]. We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming distance d, and anchor points are chosen so that every possible input is within Hamming distance k of some anchor point, then it is sufficient to send each input to all anchor points within distance (d/2)+k, rather than d+k as was suggested in the earlier paper. This improves on the communication cost of the MapReduce algorithm, i.e., reduces the amount of data transmitted among machines. Further, the same holds for edit distance, provided inputs all have the same length n and either the length of all anchor points is n − k or the length of all anchor points is n+ k. We then explore the problem of finding small sets of anchor points for edit distance, which also provides an improvement on the communication cost. We give a close-to-optimal technique to extend anchor sets (called “covering codes”) from the k = 1 case to any k. We then give small covering codes that use either a single deletion or a single insertion, or – in one algorithm – two deletions. Discovering covering codes for edit distance is important in its own right, since very little work is known.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce

Algorithms for computing similarity joins in MapReduce were offered in [2]. Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the “anchor-points algorithm” of [2]. We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming...

متن کامل

Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance

In this article, we introduce a new and simple data structure, the prefix table under Hamming distance, and present two algorithms to compute it efficiently: one asymptotically fast; the other very fast on average and in practice. Because the latter approach avoids the computation of global data structures, such as the suffix array and the longest common prefix array, it yields algorithms much ...

متن کامل

Fast index for approximate string matching

We present an index that stores a text of length n such that given a pattern of length m, all the substrings of the text that are within Hamming distance (or edit distance) at most k from the pattern are reported in O(m+ log log n + #matches) time (for constant k). The space complexity of the index is O(n1+ǫ) for any constant ǫ > 0.

متن کامل

Hardness of Approximate Nearest Neighbor Search

We prove conditional near-quadratic running time lower bounds for approximate Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance. Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false, for every δ > 0 there exists a constant ε > 0 such that computing a (1 + ε)-approximation to the Bichromatic Closest Pair requires Ω (

متن کامل

Approximating optimal solution structure with edit distance and its applications

An alternative notion of approximation arising in cognitive psychology, bioinformatics and linguistics is that of computing a solution which is structurally close to an optimal one. That is, an approximate solution is considered good if its distance from an optimal solution is small, for a distance measure such as Hamming distance or edit distance. There has been a modicum of work on approximat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013